MODEL SELECTION AND RANDOMIZATION FOR WEAKLY DEPENDENT 

TIME SERIES FORECASTING 



PIERRE ALQUIER (1) AND OLIVIER WINTENBERGER (2) 

Abstract. Observing a stationary time series, we propose in this paper new procedures in 
two steps for the prediction of the next value of the time series. Following machine learning 
theory paradigm, the first step consists in determining randomized estimators, or "experts", in 
(possibly numerous) different predictive models. In the second step estimators are obtained by 
model selection or randomization associated with exponential weights of these experts. We prove 
Oracle inequalities for both estimators and provide some applications for linear, artificial Neural 
Networks and additive non-parametric predictors. 



1. Introduction 

When observing a time series, one crucial issue is to predict first future value with the observed 
past values. Since the seminal works of Akaike, see for example [1], different model selection pro- 
cedures have been studied for inferring how many observed past values are needed for predicting 
the next value. Efficiency of different penalized empirical risk minimizers such that AIC, BIC, 
Mallows, APE's predictors have been proved when the observations satisfy a linear auto-regressive 
model, see for instance Ing [17]. The main issue in this context is to determine the order of an 
efficient predictive linear autoregressive model and then to estimate its coefficients. There the 
model fitted by the observations is assumed to belong into the same class than the predictive 
models. 

In the same time, model selection procedure have been hugely improved using learning theory 
in the independent and identically distributed (iid for short) case, see Vapnik [28] and Massart [21] 
among others. Results such that Oracle inequalities have been settled in very extended context. 
Even if the true model does not belong into one of the models proposed by the experts recent 
procedures ensure that the risk is as small as possible. However, few works have been done for 
dependent observations, principally in two direction: penalized lest square and randomization 
techniques. Baraud et al. [5] proved Oracle inequalities with respect to the quadratic loss and 
under /3-mixing condition. Their penalized empirical risk minimizers select an efficient predictive 
model when the number of useful past values is known. Recently, the theory of individual se- 
quences leads also to Oracle inequalities for risk of prediction. Randomization with exponential 
weights of experts advices predicts the observations as if it was a deterministic sequence. We refer 
the reader to Lugosi and Cesa-Bianchi [20j for more details. Good predictors are then obtained 
given the expert devices. But the form of the expert devices given the observations is not given 
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and then the form of the predictors is not tractable. 

In this paper, we give Oracle inequalities for the L 1 -risk of prediction of some stationary time 
series. We introduce two new procedures that find an efficient predictive model associated with 
an efficient number of past values. To prove this we use the PAC-Bayesian approach introduced 
by McAllester [22]. This general theoretical framework has proved to efficiently give Oracle in- 
equalities in many iid frameworks, see Catoni [T], El E] , Audibert [4] and Alquier [2]. There exist 
procedures and Oracle inequalities in the dependent cases, see Baraud et al. [5] and Modha and 
Masry [2~i] . In Modha and Masry [24], their procedure use the a-mixing coefficients of the obser- 
vations. To our knowledge, there is no efficient estimation of this coefficients and their procedure 
is not implementable in practice. In Baraud et al. [5], the Oracle inequality holds only if the (5- 
mixing coefficients and the prediction procedure satisfy together intricate conditions. Here again, 
as /3-mixing coefficients are not estimable there is no way to check those conditions. In this paper, 
the prediction procedures are for the first time completely free of the dependence properties of the 
observations. It represents an important progress for learning theory applications with dependent 
observations. Let us mention that for the estimation of the transition density of Harris recurrent 
Markov Chains, Lacour [19] gives also a procedure completely free of the dependence properties. 

Let us assume that we observe (Xi,...,X n ) from a stationary time series X = (X t )t£i 1 dis- 
tributed as ttq on X z where X is an Hilbert space equipped with its usual norm || • ||. For each 9 
in the set of parameter O we associate a p(#)-autoregressive function fg from X p ^ to X that rep- 
resents a predictive model. Then each 6 € O is associated with a predictor X n _ p ^). 
The risk of prediction is the absolute loss R(6) defined as: 

R{6) = 7T [||/e(^p(e), Ai) - Ap(0) +1 ||] , 

where here and all along the paper Tr[h] = J hdir for any measure tt and any integrable function h. 
The choice of this risk instead of the classic quadratic loss is due to its Lipschitzian property, very 
well suited with the dependence context here. The main objective of this paper is to determine 
two different procedures that give estimators 6 n with associated risk R(9 n ) satisfying an Oracle 
inequality - in other words, R(9 n ) is not far from infe R. 

As we have to deal with different models and different delays in the same time, it is convenient 
to split the set O in subsets of the form: 

|_f J m p 

© = |J Op with P = (J P/ , 
p=i i=i 

where m p > has to be fixed carefully. The set P consists in different predictive models that 
need the same number of past values. To fix the idea, let us give the simple example additive non 
parametric predictive models when X = R, see Subsection 14.31 for more details. Let us define 

p 

A n +i = fi(x n ^j). 

i=0 
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Then we fix 9 n = ((/i)o<t<p)- We split 



LfJ LfJ 
e= |J e p = (J {{fi)o<i< P e Ap} 



where C is a compact subset of R and A p is a compact subset of J 77 ^ 1 for T the set of integrable 
functions from R to R. Under suitable conditions on J 7 , there exists an ordered functional basis 
(fi)i>i- Then the index £ corresponds to the number of the firsts functionals in the basis that we 

consider. Then /, = Y,j=i a i,j^j for each * and &p,e = {( a i,j)o<i<p,i<j<i}- 

The common first step of our two prediction procedures consists on proposing a randomized 
estimator 9 p> e for each subset @ p /. Then we propose two different estimators 6 and 6 of a parameter 
9 associated with an efficient predictive model. The first procedure is a model selection that 
provides (p, £). It leads to the natural choice 9 = 9^j. Our model selection criterion for each 
indices (p, £) is close to the following penalized empirical risk criterion 



r n {9p,t) + 4/ ^ ln(dpin), 
y n — p 

where r n {9) is the empirical risk, d p ^ is a measure of the complexity of @ Pl e, highly related to its 
dimension, and K n > is independent of p, i. The second procedure is a second randomization 
step on the indexes (p, i) that gives (p, £) and then leads to the corresponding estimator 9 = 9~~ e . 

p,z 

The exponential weights associated to each indices (p, £) have the same form than the ones used 
for randomizing expert devices in the theory of individual sequence. They deeply depends on a 
parameter K n > 0. 

The value of K n has to be fixed arbitrarily and it has lot of consequences on the sharpness of 
the Oracle inequalities we obtained. For bounded observations, the best is to fix it larger than 
some constant depending on the (non-estimable) dependence properties of the observations. If we 
fail, remark that a less good Oracle inequality still holds, see the results in Section [3l For possibly 
unbounded observations, we can fix it proportional to ln(n) independently on the observations. 
Such choice leads to an additional logarithmic term in the rate of convergence. But remark that 
even for K n fixed as a constant we over-penalized the expected risk there is always additional 
logarithmic terms in the rates of the Oracle inequalities, see below. So we can fix as a rule of 
thumb K n = C m(n) for some known C and our procedure is free of the dependence properties 
of the observations, see Subsection I3JU for more details. 



Let us resume the main results of this paper for K n fixed to ln(n). For bounded observations, we 
prove a Probably Approximately Correct Oracle inequality: for n large enough, with probability 
at least 1 — e 



R(L) < min < inf R(9) + C 



' 5Zln(d p/ /e)ln 2 (n) 
y n — p 
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where C is a constant. For possibly unbounded observations, we obtain Oracle inequalities in 
expectation. More precisely, we obtain that for n sufficiently large 



no[R(9 n )] <mm\ inf R[9) + C J — ^- ln(cL e ) ln 2 (n) 
P ,l I e Pii! y n -p 

where C is constant. This result can be compared with those of Baraud, Comte and Viennet [5] 
and Modha and Masry [24]. They achieve respectively Oracles inequalities of the form 

M(R'(9 P ,n)] 

MR' n )} 

where R' is the excess quadratic risk, < c < 1 is a constant depending on the dependence struc- 
ture of the observations and C is fixed by the statistician. Our Oracle inequalities are sharper 
than the ones of [24]. Baraud et al. [5] achieve the opitmal rates and we fail, but with a loss in 
the constant. Moreover, as already noticed, those authors are not fully adaptive in p. 

To obtain such Oracle inequalities, sharp exponential inequalities are used in the dependent 
setting. For this, weakly dependence properties on the observations are assumed. This dependent 
setting might be more general than the mixing one, see the monograph of Dedecker et al. [10| . 
Here we use in the bounded cases the ^-coefficients (also called 7-mixing coefficients) introduced 
in Rio [25] to derive a sharp Hoeffding inequality in the dependent framework. These coefficients 
generalize the uniform mixing ones. In the unbounded cases we use generic models called chains 
with infinite memory introduced by Doukhan and Wintenberger [T3] that includes many classical 
econometric models such that ARMA, GARCH and LARCH. Here we work under restrictions 
of additive forms that unfortunately exclude unbounded volatility models, see Subsection 12.41 for 
more details. Our dependent framework is not comparable with the 0- or a-mixing one as it deals 
with some dynamical systems that are not mixing, see Andrews [3j and Dedecker and Prieur [TT] 
or details on these counter-examples. 

The paper is organized as follows. First some notation, the framework and the predictors are 
introduced in Section [2l Then the Oracles inequalities and some comments follow in Section [3l The 
main results of this Section are applied for Linear predictors, artificial Neural Networks predictors 
and Non-parametric Auto- Regressive predictors in Section HI Finally the proofs are collected in 
Section 03 

2. Preliminaries 

Let X = (X t )t£Z be a stationary process taking values in a measurable Hilbert space (X,B) 
(with norm |.| and scalar product (., .)). Assume that X is distributed as ttq. 

2.1. The predictive models. For any p G {l,...,n — 1} and any I G {1, ...,m p } with m p > 0, 
any parameter 9 of the set ® Pt e, compact subset of M 9 for some q < 00, is identified with a function 



< (1 + mini inf R'(9)+C 3 -^} for each p, 

< (1 + ~) mm {inf R>(6) + C (™eA ° In(^) ) 
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fe : X' p — > X. Let us assume that there exists a sequence (oj(#))je{i,...,p} satisfying the relation 

v 

(2.1) J>#)<L. 
such that for any (xi, ...,x p ), (yi, ...,y p ) £ X p we have: 

p 

(2-2) / e (xi,...,x p ) - ...,%,) < ^0,(0) \\xj - yj\\ . 

3=1 

Moreover, we assume that the P are disjoint sets for all p £ {1, |_ri / 2 J } so that any 9 £ 
belongs to one and only one P . We write p{9) the corresponding value of p. Let us define the 
set of indexes: 

L§J 

M= \J{p}x {l,...,m p }. 

p=0 

Finally, T denotes a a-algebra on 6, and for any (p,£) £ M, T Pt g denote the restriction of T on 

2.2. The risk. For a chosen 9 £ from the observations, we check the ability of X e N = 
/flpOv-i, X N _ p (Q-)) to predict Xjy for any N £ Z. The error of prediction R{9) is the ex- 
pectation of the absolute loss of by X e N which do not depend on N by stationarity conditional 
on the value of 9: 



R{9) = 7T 



X 1 - X° 



The objective is to determine 9 n such that its risk is close to R{9) where 9 £ argmine e e R(0). 
We define also the values 9 P / for any (p, t) £ M by 9 Pt e £ argmine g e p ^ R(9). 

The risk R(9) cannot be computed as the distribution iro is unknown. So we introduce its 
empirical counterpart r n (9) as, 



r n (9) = 



n — p(9) 



E 



x t -x 



t=p(9)+l 



2.3. The estimators. For any model (p, t) £ M let us choose a probability measure tt p ^ on the 
measurable space (Q Pt £,T p ^) - an usual choice for ir Pt £ is the Lebesgue measure on Q p j that is 
often a compact subset of R d for some d > 0, but note that the choice of the various parameters 
involved in this subsection is discussed later in the paper and illustrated by a simple example. 
Let us also choose some prior weights on the models: w Pj £ > such that Yl(p £)gm w p/ — 1- This 
choice will be discussed later. 



For any measure ir and any measurable function h such that n[exp(h)] < +oo, we define the 
Gibbs measure ir{h} through the equation: 



(2.3) 



Mh} = exp(h(9)) 
dir 7r[exp(/i)] ' 
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Let us put 

f ln(n 2 ) "I 

= «|8,16,...,2L ln2 J L 

Now let us choose some K n > 1, see Remark 13,3.31 for a discussion of this choice. For any 
(p, £) G M and A > we define: 

R(pJ,X) = -Un [ exp (-Ar n (0)) ^(0) + j In M + . 2 ■ 

Now, we propose, for any model (p,^) G M and parameter A £ 6 the following estimation 
procedure: draw 

6p,l ~ ^/{-Arn}. 

Then, we propose two procedures to select a model (p, £) G M and a parameter \ £ Q. 
The first procedure is a model selection procedure, we choose 

(p, £, A) = arg min R (p. £, A) 

(p, t) e m 
a e e 

and we have the estimator 

= 0*-. 

The second procedure proceeds by randomization on all the models. For any A G 5, we define 
the weights 

exp (—R(p,£,\) 

and we draw (p x ,£ x ) randomly according to the weights (Wpi)( Pl e) and finally choose 

A = arg min R (p x ,£ x ,X 
and we have the estimator 

2.4. Assumptions on the observations. In order to achieve our main results, we need to give 
some assumptions on the observations. We give below two very different settings of works. One is 
based on a specific (but wide) unbounded model so-called chain with infinite memory (or complete 
connection). The other one referred on the bounded case associated under a condition of weakly 
dependence type. See section [4] for some examples. 



2.4.1. Chains with infinite memory. We study chains with Infinite memory introduced in [14] 
and we refer to it as Assumption (CIM). Let £t for t G Z be independent and identically dis- 
tributed variables distributed as /j, on a Banach space X' called the innovations. We assume that 
the innovations norm admits a Laplace transform, more formally that for all c G M, we have 
/i[exp(c||£o||)] < °°- We write this Laplace transform \&(c) := /x[exp(c||£o||)]- ^ e w ^ sa ^ ^ na ^ 
(CIM) is satisfied if X = (X t )tei, is the solution of the equation 

(2.4) X t = F(X t -x,X t -2, •••;&) almost everywhere, 
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for some function F : Af^W) x X' — > X. Assume also that there exits some u satisfying, for all 
x = (iCfc)fceiN\{o}i x ' = ( x fc)feeiN\{o} e A' M ^°J' such that there exists N > clS Xfc — 

= for all 

k > N , the condition 

oo 

(2.5) ll^foy) - F(x';y')\\ < Y, a i( F )\\ x J ~ x jW + u h ~ v'l 

i=i 



(2.6) with y*aj(F) := a(F) < 1 



oo 

£■ 

3=1 



Using directly Theorem 3.1 of [14] we derive the following propostion: 

Proposition 2.1. There exists a unique stationary causal solution X of equation (12.41) satisfying 
^Dl^oin < 00 /or any 1 < r < oo. 

2.4.2. Bounded weakly dependent processes. We refer to this case, described below as Assumption 
(WDP). In all this subsection we assume that X is bounded, i.e. ||-X"||oo < °o- In our context 
the appropriate weakly dependence notion is relying on the coefficients #oo,n(l) introduced by 
Dedecker et al. [10]. This is a version of the 7- mixing of Rio [26] adapted to stationary time 
series. If Z is a bounded random variable on (fl,A, P), for any cr-algebra & of A we put: 



6*00(6, Z) = sup 
/eAi 



E(/(Z)|S)-E(/(Z)) 



where Ai is the set of real 1-Lipschitz functions and E is the expectation with respect to the 
distribution of X. In our context it is convenient to define the cr-algebra & p = a(X t ,t < p) for 
any pgZ and 

0oo,k{ r ) = su p| 6, oo(e p ,(A jl ,...,A i ,)), p + r <ji < ■■■ <j e , l<£<k\, 

Assumption (WDP) refers to the cases where $oo,n(l) is well defined for the process X — (A^^gg. 
Let us give examples of time series satisfying (WDP). 

Bounded chains with infinite memory are 9^ weakly-dependent. Suppose that X is the solution 
of equation (|2.4I) associated with an innovation which is bounded, i.e. ||£o||oc < then we have 
the following result, 



Lemma 2.2. Under condition (12.6(1 there exists a unique causal stationary process X solution of 
the equation (12. 4|) . This solution is bounded by m||£o||oo/(1 — a) and 



n 



a(F) ^o<p<r 
\ 1 r= i 



The proof of Lemma [231 is given in the section dedicated to proofs, actually, in Subsection 15.51 
Uniform (^-mixing sequences are also #00 weakly-dependent. Let us recall the definition of the 
c/9-mixing coefficients introduced in [16| : 

ip(r)= sup \tt(B/A) — tt(B)\ 

(A,B)e S x5 r 

where 3r = cr(Y t ,t > r). The class of t/?-mixing processes is large, it includes in particular uniform 
ergodic Markov Chains, see [13J. 
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Proposition 2.3. If (Xt)tez is a stationary bounded (by C > 0) process then 



0oo,n(l) < 2C^<p(r). 



r=l 

The proof of the Proposition 12.31 is given in Subsection 15.51 

3. Main results 

We first give a result that holds with a probability that may be as close as one as possible, then 
we give a result in expectation. Then some remarks on these two oracle inequalities are given. In 
all the sequel, we work under the assumption that, for every (p,£) £ M there exists a constant 
1 < dp 7 i < oo such that 

f -ln^ [exp(- 7 (fi-^)))] | 
(3 - 1} ?g\ Hj) j =P/ - 

Even if this definition of the "dimension" of each sets Q Pt g is non standard and comes artificially 
in this form from the PAC-Bayesian approach, it is linked with the standard notions of dimensions 
like the Vapnik or entropy one. More precisely we have the following result. 

Proposition 3.1. Let dim £ I*, i > 0, and B x be the closed l x -ball in K, Am of radius x > and 
centered at 0. // we assume that G Pj £ = B Cpl for c Pt e > 0, that ir Pt £ is the Lebesgue measure on 
@ p / and that 8 — > R{9) is a C -Lipschitz function then we have: 

(3.2) d Pi e < dim x ( 1 + In ( c p ^ I V ' ^ 



v dim c P,e ~ \\ 9 pA J J J 
The proof of this result is given at the end of subsection 15.41 

3.1. Oracle inequality with large probability. The following inequality is a PAC result. It 
is very convenient to built confidence intervals of prediction, see [28] for example for more details 
on such confident intervals. 

Theorem 3.2. Let us assume that relation (|3.1I) is satisfied. Then for all n such that rain 2 n > 
(8eK n ) 2 , under (WDP), with ttq -probability at least 1 — e he have: 



R{9) and R{6) < ( 1 V j^^] inf \ R (9 p/ ) + K n 



1 — p/n V n 



2 + (^/^) 2 ./^ £ln( ^ n) 



4 In 



3 In n 



: '"V.< 



y/dp t enln(d Pt en) 



where 

/o \ , Halloo + 26>oo n (l) 

(3.3) k n = — . 

The proof of this result is given in the section dedicated to proofs, more precisely in Subsection 
IO page [171 
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3.2. Oracle inequality in expectation. The following oracle inequality holds in expectation, 
it is a weaker result than the previous one of Theorem 13.21 but the setting is more general as it 
holds under both (WDP) and (CIM). 

Theorem 3.3. Let us assume that relation (|3.1I) is satisfied. Then for all n such that n\r? n > 
(8eK n ) 2 , we have: 



n [R(6)} andir [R(8)] < ( IV 

+ K n 
where as previously 



2kl 



inf { R i 



2 + {k n /K n ) 2 d p 



1 — p/n 



4 In 



12 Inn 
Wp.t 



\n(d p (ji) H -= 

n ■ y/dp t trihi(dp t en) 



+ 



3(1 + L)tf(c*)ln(n) 



■n 



kn 



|-^||oo + 2^oo,n(l) 

TTl 



under (WDP), 



and where under (CIM), 



(3.4) C,„ = 1 + 2^ n ^ { a{FY' k + aj (F) 

j=k 



u8* 



r=l 



2(1 -a(F)) 



and k n 



Inn 
1 + L' 



The proof of this result is given in Subsection 15.31 page [2TJ 
3.3. Comments on the main results. 

3.3.1. Comparison with other results. Oracles inequalities in expectation have already been proved 
in Modha and Masry [2.4] and Baraud et al. [5]. Their approach are based on traditional mixing 
coefficients and on classical penalized minimizers of the empirical risk estimators. As already said 
in the introduction, except the fact that they work with the quadratic loss, their results are very 
comparable with ours. Our rates are always smaller than those of Modha and Masry [24], as in 
their case it depends on the decrease rates of the mixing coefficients. The results in Baraud et al. 
[5] are very competitive with ours. They achieve the optimal rate of convergence, i.e. the optimal 
one in the iid case, but they pay it with a multiplicative constant larger than 1 in the oracle 
inequality. More important, their approach depends on the (unobservable) mixing properties of 
the observations through intricate conditions on the model dimension and on the penalization. 
This drawback of their approach is due to the use of the /3-mixing coefficients of the time series. 
The weak dependence coefficients used here lead to a sharper Hoeffding type inequality than the 
/3-mixing coefficients, see Rio [25J and its consequence the Theorem 15.61 of this paper. Then it 
is for the first time possible to consider here predictors free of the dependence properties of the 
observed time series. Remark that as the choice of the dependence framework is orthogonal to the 
one of the estimation procedures, it should be interesting to study classical penalized empirical 
risk minimizers in the weakly dependence framework used here. 



3.3.2. Choice of the weights. When k n > K n the order of convergence of -kq[R{9)} to R(6 P ^) 
given by the expression 



is 



10 
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and so, by 



n 



< \ In (d p gri) 

Wnd Pi i In (dp^n) V n 

Then if the weights are chosen such that satisfying the condition 



as soon as 

i n iiL?i rr 



e 

(3.5) w Pt e > - 



-dp i ln 2 (nd p ^) 



ln(n) 

As the sums of these weights have to be less than 1, such choice is possible if 



If this condition is not satisfied, there may be a loss in the bound on the risk of the estimator due 
to the choice of the weights. This loss is classical in learning theory and has nothing to do with 
the PAC-bayesian approach used here. In some way it means that we cannot perform an efficient 
model selection if we have too many models. 

3.3.3. Choice of the parameter K n . The best choice for the parameter K n is k n that depends 
on the known parameter L and on the non- observable dependence structure of the observations. 
So, we will discuss in practice this choice is delicate. In the sequel of this discussion we work 
under reasonable weak dependence and complexity conditions, i.e. that #oo,oo(l) < oo , 9^ ^ < oo 
if (CIM) and (13. ip . For bounded (/j-mixing processes it is ensure by the summability of the 
ip- co efficient s . If we do not have any reason to assume that the process is bounded, we can fix 
K n = lnn/(l + L) and w Pt i as in Subsection 13.3.21 in order to obtain the oracle inequality: 



7r [R{e)] and tt o [R0)] < inf \ R (6 p/ ) + cJ ^ ln(d p>e n) ln(n) 



for some constant C > 0, as soon as n is sufficiently large. If we assume that the observa- 
tions are bounded, we can get a refinement choosing K n as an upper bound for k n = (\\X\loo + 
2#oo,oo (1))/(1 + L). If we are lucky and the relation K n > k n is satisfied, we obtain under (WDP) 
and with probability 1 — e 



R0) and R(6) < inf (6 p/ ) + C^j ^ In ^ j . 

Remark that it is possible that L goes to infinity with n such that for any fixed K n = K then 
K > k n for large n. But then another loss in a power of ln(n) appears as the "dimension" d p ^ 
grows with L, see the application on Neural Networks predictors in Subsection 14.21 If we do a 
mistake in the upper estimate on k n , namely K n < k n , then a multiplicative constant c E]l,2[ 
deteriorates the oracle inequality under both (CIM) or (WDP): 



n o [R0)} and 7r < cinf | J2 (0 Pjt ) +C]J^ \n{d p/ n)(k n ) 2 j . 

Such choice of small K n no longer ensures the consistency of the estimator. So we recommend to 
choose in any cases the parameter K n = lnn/(l + L) that is free of dependence properties. As a 



MODEL SELECTION AND RANDOMIZATION FOR WEAKLY DEPENDENT TIME SERIES FORECASTING 11 



consequence, we do an over-penalization and the procedure is very conservative, see the discussion 
based on simulations at the end of Subsection 14.11 

4. Applications 

In this section we investigate several possible predictors. Note than in all the applications, 
we work on unions of compact subsets of parameters P: i of M. dtm for some dimension dim £ N 
associated with the prior measure Tr Pt g that is the Lebesgue one. The "dimension" d p> i is then 
closely related to dim thanks to Proposition 13.11 

4.1. Linear predictors. Let us first consider the case of linear auto-regressive predictions. More 
precisely, in the case <Y = lwe consider predictive models of the form: 

v 

hi^N-l, ■■■,Xn-p) = 0Q + ^ OjXN-j, 

1=1 



where 9 G p C W +1 with by definition, for some c p > 0, 



p 

i=0 



In this simple case m p = 1 for all p such that the index I can be omitted in the sequel. Using 
Proposition 13.11 it follows that 



^<(p + l)(l + ln(<>(^V — »))) 



where 6 P = argmine p t R{0). Let us fix the weights equals to w p = 2~ p ~ 1 for all p £ {1, [rt/2]}. 
Then the relations ^Z(p,e)eM w p,i — ^ an( ^ §32§ is satisfied for large n and we have the following 
Corollary of Theorem 13.21 



Corollary 4.1. Let us assume that there exists £ > such that for any p, \\0 p \\i < c p — £. For 
n large enough, let us assume that that (CIM) or (WDP) is satisfied, that K n > then there 
exists a C = C(c p , £, K n , ||X||oo) under (WDP) or C = C(cp, £, K n , 9*^ n ) under (CIM), such 
that 

ir [R(9)} andir [R{9)] < ^inf J^R (9 p ) + CK n ^ln(n)|. 

It is a simple consequence of Theorem 13.21 in this context so the proof is omitted. 
Linear predictors are expected to be efficient when the observations are solutions of a linear 
autoregressive model. Let us assume that (X^tez is a stationary solution of an AR(oo) model 

oo 

(4.1) X t = a + aiX t ^ + for all t € Z 

i=i 

where £t are iid. Here we do not distinguish degenerate cases, i.e. (a«)j>o may or may not be 
a sequence of infinitely many non zero numbers. So AR(p) for p < oo or AR(oo) models are 
considered in one shot. Assume that 

(AR) £ i>0 kl < i, 
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and that £1 are normally distributed as \i. Then it is easy to check that we are in the case 
CIM. Moreover, the distributions of X t +± conditional on (X t ,X t -i, . . . ,X t - p ) are gaussian then 
symmetric and the median is also the mean in order that 

Op = ( a 0> a li «2, • • • , %>)■ 

In this classical case, we have a corollary of Theorem 13.21 

Corollary 4.2. Let us fix c p = 1 for all p, w p = 2~ p ~ 1 and K n = ln(n)/2. Then there exists a 
constant C = C(/j,, (ai)ieN) linear predictors 9 and 9 satisfy, for sufficiently large n: 

n [R(9)] andir {R(9)] 



|tto 


1 y^aiXj\ 




i>p 




</i[|&|]+ inf 

l<p<n/2 

The proof is omitted as it is a simple consequence of Corollary 14.11 

Despite its apparent complexity, the procedure used here can be effectively implemented, using 
Monte Carlo methods, see for example Catoni [5] for an effective implementation of PAC-Bayesian 
methods. Actually, in this case the performance of the predictions is clearly not optimal on simula- 
tions when compared with the estimators in Ing and Wei [18] on the same set of experiments. Our 
procedure is clearly too conservative due to the minimax-type approach used here that focusses 
on pessimistic bounds based on the worst cases. The improvement of the practical performances 
of predictions will be the subject of future works. However, if the reader is interested, the code 
for the computation of the estimator is available upon request to the authors. 

4.2. Neural networks predictors. In this section we consider the bounded case (WDP) where 
Halloo < 1- The neural networks predictors proposed here are close to those in Modha and Masry 
|24j . The procedure approximates a natural good predictor given by m p (X n _ p , . . . ,X n ) where 

m p (x) = med{Xo\ (X- p , . . . , = x) for all x G M p , 

the median of the distribution of Xq conditional on p past values (X_ p , . . . , X-i). This nondinear 
predictor is the optimal one with respect to the L 1 -risk. 

We will now present the predictors which are parametric families of functions based on the 
abstract neural networks used in Barron [6]. Let us assume that (j) : E — ► R is a Lipschitz 
sigmoidal function such that its tail approach the tails of the unit step at least polynomially fast. 
More precisely, let us have the assumptions: 
(NN): Assume that 

(1) 4>{u) — ► 1 as u — ► oo and 4>(u) — ► as u — ► — oo, 

(2) 4>{u) - 4>{v) < D[\u - v\ for all u,v € M and for some D[ > 0. Set D 1 = 1 V D[. 

(3) \<j>(u) - l„>o| < D' 2 /\u\ D3 for u e M, u ^ and for some D 3 > and D' 2 > 0. Set 
D 2 = 1VD' 2 . 

(SN): Assume that there exists a complex- valued function rh p on W such that for x G W, 
we have 

m p (x) - m p (0) = / {e lwx - l)m p (w)dw 



and that 

||io||i|m p (u;)|dz/; < C' < oo 
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for some C' p > 0. Set C p = 1 V C' p . 
Then the predictors express as parametric families of functions. More precisely, they are neural 
networks with dimension, or "hidden units", £ and memory, or "time delay" or "lags", p. Let 
dj G R p , bi G R and Cj G R for 1 < i < £. Setting 6 = (a^, 6j, q, cq) for some cq G R, we remark 
that the dimension of one predictive model is £(p + 2) + 1. The predictors are defined as 



fe = clip ^c + ^2 c i<K a ' " x + 



for all x G 



where clip(y) = y A 1 V (—1). Now we restrict the parameters to be in the following ball. Define 

Tl = 2 (2D 3 +l)/D 3D l/D 3e ( D . s+1 )/(2D 3 ) 

where D\,D2 and D3 are as in Assumption (NN). Define also a compact subset 

v—-\ 1 11 

B P ,e = l Ci ' - C p Jr q' max „> II *!! 1 - ^ + o7' max J^I - ^ + ^ 

i=l 

where || • ||i denotes the ^ 1 -norm. Remark that here the constants are added in each direction to 
have a secure zone of width 1 in the £ 1 -norm around the classical parameter set: 

1 

Bl t> = >^ \<k\ < C»; max lloill < max \b{\ < tp\ 
P ,c i> i<i<e" i<i<£ 

With the help of this secure zone, we have an Oracle inequality where the infinimum is taken 
on the classical sets B p> e for whom the optimal 9 Pt e has got the good approximation properties, 
see [6]. Moreover as c p / — \\6 P / is bounded by 1 it implies that for large values of p, £ it holds 
dp/ < (£{p + 2) + 1)(1 + ln(C p V £t£ + 1)) applying Proposition 13.11 Finally, let us fix the largest 
possible value for £ as m p = [y n/p}. It is enough for having a good approximation thanks to 
Theorem 3 of [6|. Then we have the following result for neural networks predictors construct on 

Corollary 4.3. Let us assume (WDP) with ||^l||oo < 1, #00,00 (1) < 00, (NN) and (SN) with 
C p < C'p c for some C , c > and all p > 1. Then if we take w p = 1/n and K n is fixed to some 
K, for all e > there exists a constant C = C(C ,c, D\, D2, D%, K,e) such that for n sufficiently 
large, with probability at least 1 — e, 

< y/ 4 ln 3 / 4 nl 



R{6)andR{9)< inf I tt [\X - med(X \X- lt . . . , X-p)\] + C 

l<p<V"/ln(n) U 



1/4 



The proof of this corollary is given in Subsection 15.61 Following the approach of Modha and 
Masry [24], our estimator is said to be a memory universal predictor with rate \v?^{n) /n 1 ^ . The 
rate here is better than the one obtained for the L 2 -risk in [24], (ln(n)/n)2 where < c < 1 
depends on the mixing properties of the process. Remark that the choice of w p is not optimal 
as it does not satisfy the relation (|3.5p . This implies a loss, due to that we do not manage to 
estimate ^exp(— d Pi e) here. However, this loss due to the weights is less than the one due to the 
fact that here L, the Lipschitz constant of the predictors, goes to 00 with n. It implies the loss 
of a square root of the Logarithm through the "dimension" d Pt £, see the proof for more details. 
Finally, remark that the result is not easily implementable as artificial networks predictors depend 
on the constants C p which are not observable. 
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4.3. Non-parametric auto-regressive predictors. In this section, we propose the following 
setting coming from economic modelization and studied in [5]. Let us assume that the process 
(Xt)t£Z is a solution of the equation: 

X t = /i(X t _i) + • • • + f PQ {Xt- P0 ) + 6, , for all i G Z 

where £t ~ AA(0, cr 2 ) =: //, fi are functions R — > R supported by a compact set and po is some 
unknown finite integer. Remark that, up to scale changing of X , functions /» are supported by 
[—1,1]. In order to be in a particular case of (CIM), we assume that for any i G {1, ...,po}, 

(4.2) 3a, G [0;l[,V(x,x') G [-1;1] 2 , - h{x')\ < Oi\x - x'\ 

with a\ + ... + a po < 1. 

Actually, we assume more regularity on every ff. they belong to the Holder class H(si,Li) for 
Si > 1. This means that /, is derivable [si\ times and that 



(4.3) 3L t > 0, V(x, x') G [-1; l] 2 , |/p J) (x) - f^ }, (x')\ < U\x - x'\ 

Remark that if (|4.3p is satisfied with the relation 

Po , 



t (L»iJ)/ 



-L«iJ 



(4.4) 



+ ••• + 



/P lJ) (o) 



+ U < 1 



then (|4.2p follows. Is it well known, see for example Tsybakov [27], that if (<fj(-))j>i is the Fourier 
basis on [—1,1], namely 4>2k(x) = v / 2cos(27rA:x) and <p2k+i{x) = V2 sin(27r/cx), Assumption 14.3 
implies that /« belongs to a Sobolev class with regularity s, and so that there is a constant 
7i = 7 (Li, «i) such that for any m G M \ {0}, 



mm 

(ai,...,a m )eE" 



3=1 



ds 



< jim Si . 



Then natural predictors arise as of the form 



A n+ i — 6ij(pj(X n -i) — : fg(X n , . . . , X n _ p ) 

i=l i=i 



for any p G {1, |_ n /2j}, any £ G {1, ...,m p = n} and any G M satisfying the relation 



t=i i=i 



This ensures that any /g is an L-Lipschitz function. Finally, let us define for any t G {l,...,n}, 
i G {1, . . . , the coefficients 9 p ^ G MP that satisfy the relation 



arg mm ttq 



v I 

X n — £ £ 9ijipj(X n 
i=i j=i 



and we obtain as a consequence of Theorem 
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Corollary 4.4. Let w P; £ = n~ 2 , K n =ln(n)/(l + L) and s = inf{si, s Po }, Let us assume that 
Equations (|4.3j) and (14,4(1 are satisfied and that there exists c > such that for any I G {l,...,n} 
we have 




L- I > > )f;(2f/72l)-' 1 > r. 

T/ien i/iere is a constant 

C = C(p ,a,L 1 ,s 1 , ...,L po ,s Po ,c) 

such that 

tt [R(6)} and ir [R(p)] < fi(\£ \) + Cn"^TT hi 2 (ra). 

The minimax rate of convergence with respect to si,...,s po is achieved up to a loss in ln 2 (ra). 
Remark that the choice of the weights w Pt g is not optimal as it does not satisfied condition (13. 5h . 
But it has no effect in the rate in the Oracle inequality as the loss coming from the weights is 
smaller than the one coming from the over-penalization. Reark also that the result in [5] achieves 
the minimax rate of convergence with no extra logarithmic factor. This minimax rate is achieved 
for the excess L 2 -risk, not of prediction, but empirically on the distribution of the observed values. 
We argue that our risk is more natural in the time series forecasting context. However, note that 
in [5] it is assumed that po < p max for some known p max satisfying some relation with the /3-mixing 
coefficients of the observed process. It is restrictive as that model selection procedure depends on 
p max and on /3-mixing coefficients that are not observable. 

5. Proofs 

To present the proofs in a unified version wether we work under (CIM) or (WDP), we truncate 
the observations if we are under (CIM). This method entirely stands in view of the result of 
Lemma [231 More precisely, we truncate the innovations £t and replace them with £ t = (£ t A C) V 
(— C). Now we denote X = (X t )t£Z the solution of the equation 

X t = F(Xt-i,Xt-2,X t ^, . . .;&), a.e. for all t G Z. 

This solution exists and satisfies weak dependence conditions, see Lemma [231 for more details. To 
treat both cases in the same way, we denote in the sequel X := X and ||^||oo = C under (WDP). 
Moreover, we will use the notation r, R the risks associated with X. 

We will now present some useful Lemmas. Their proofs are postponed at the end of the Section. 

5.1. Useful Lemmas. The first Lemma gives a bound on the deviations of the risk of X. The 
result derives simply from the Rio's "Hoeffding's type" inequality stated in |25j . 



7r Q [exp(A(#(0)-r n (0)))]<exp 



Lemma 5.1. For any A > and 9 G we have: 

n{\ -p(Q)jnf , 

where k n depends on the nature of the observations, more precisely is given by the relations 

uC(l + 2 £? = i mfo< fc <, Un /h + E°° =fe a*(*")}) 
(CIM) kn = — (l + L)(l-a(F)) " 

(wdp) ^ n = HkL^W. 
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The proof of this Lemma is given in Section 15.41 

We now give a result particularly useful for the so-called "PAC-Bayesian" randomization tech- 
nique proposed by Catoni [7J [8]. Given a measurable space (E,£) we let M. l + {E) denote the set 
of all probability measures on (E,£). The Kullback divergence is a pseudo-distance on M.\{E) 
defined, for any (vr,7r') G [M\ (E)] 2 by the equation 



/C(7T, 7T ) 



7T[\n(dir /dn')] if it -C it', 



+oo otherwise. 



Lemma 5.2 (Legendre transform of the Kullback divergence function). For any it £ AA l + (E), for 
any measurable function h : E — > IR smc/i ttaf 7r[exp(/i)] < +oo we have: 



(5.1) 



n[exp(h)] = exp sup ( p[h\ — K{p,tt) 
VpeA4i(S) 



convention oo — oo = — do. Moreover, as soon as h is upper-bounded on the support of it, the 
supremum with respect to p in the right-hand side is reached for the Gibbs measure tt{K] defined 
in (EH). 



The proof of this Lemma is omitted here as it can be found in [7] or [8]. 

With the help of Lemma 15.21 we then can prove a general PAC-Bayesian bound from Lemma 
15.11 It consists in an upper-bound for the mean risk of Gibbs estimators in all sub-models. 

Lemma 5.3. Under the assumptions of Theorem \3.S\ we have for any A > and (p,£) G M: 

( \ 2 k 2 



(5.2) 



7T 



exp I sup {\p{R 

\peMi.{e Pit ) 



r n ] - K (p,TT P:e )} 



n(l — p/n)' 



< 1, 



where k n has the same expression than in Lemma \5.1\ 



The proof of this Lemma is given in Section 15.41 

From this result, we derive another PAC-Bayesian bound on the mean risk of any aggregation 
estimators of all Gibbs estimators. The techniques were developed by Catoni [U [7] in the iid 
or exchangeable setting for classification on the basis of the seminal paper of McAllester [22] 
and extended by Audibert [4] to regression with quadratic loss and Alquier [2] to a general loss 
function. The scheme use here follows [7]. 



Lemma 5.4. For any measurable function p P: i : X r 
measurable family of weights w^g : X n — > [0, 1] with 

(p,£)eM 

under the assumptions of Theorem we have: 



M\(e p>e ) for (p,£) £ M and for any 



w p\i < !> 



7TQ 



sup 

AGS 
P P ,i 6 M\{Q Vt e) 



exp ^X(R 



In 



dp. 



•p.C 



1l2 



dir Pt i n{l-p/n)' 



+ In 



w,, 
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Vvr 



sup 

AGS 
0, t) G M 
p P ,e G M\(& p> e) 



exp ( Xp P) i[R - r n ] - JC(p Pi £, tt p /) 



2 1,2 



\ z k 



n(l — p/n) 2 



+ In 



< 1, 



where k n has the same expression than in Lemma \5.1\ 
The proof of this Lemma is given in Section 15.41 

Finally, we present a Lemma that gives a usefull inequality under (CIM). We recall that ^ 
denotes here the Laplace transform of the norm of ||£o|| ; that is assumed to be finite. 

Lemma 5.5. Let us define the following random variable: 

g(C) = sup \r n (6) -r n (6)\. 
see 

We have under (CIM) the following inequality, for any c > 0: 

1 + L 



Mg(C)} < 



-u^{c)Cexp(-cC). 



1 - a(F) 

The proof of this Lemma is given in Section 15.41 

5.2. Proof of Theorem 13.21 We are now able to give the proof of our main theorem. Let us 
apply Lemma 15.41 As it holds for any probability measure p Pt £ it holds for p p ^ = n Pj e{— Ar n } 
associated to any A = Q. We use the inequality Vx G R, exp(x) > Ijr* {x) and the associated 
Markov inequality: 

Tr p Pj t(A + ln(e) > 0) < 7r p p ,e[exp(A)}e < e 



where A = \(R 



hi 



dp p ,e 



dTTpi n(l— p/n) 2 



+ In 



\G\w x 



for any A £ Q. Here we used the fact that 



p P: £ is a probability conditional on (Xi, . . . , X n ) in order that nop p / is a well defined probability 
measure. Moreover, we have used elementary convex inequality to get rid off with the sum of the 
weights Wp £ as they are fixed. With probability 1 — e on the drawing of the data with respect to 

7To and on the drawing of all the estimators 9\ with respect to pCg and on the drawing of p and 



according to Wp £ , we have, for any (p, €) G M: 



A 



dp p 



dir. 



p.L 



A 



A 



£ 



(5.3) R{^)<f n [o; /) + n{i _ p/n)2 

Using the same technique but with the second part of the result of Lemma [5741 we obtain for any 
(p,£) G M, A G Q and p G M\{Q p/ ), 



(5.4) 



r n {0)p{dO)< / R(9)p{d9) + 



p,t 



— + -K (p, tt p/ ) + - In -L-L 

n(l — pjny A A w Pj i 



1 i 1 
+ - In-. 

A e 



Note that (15. 3|) is equivalent to 



R 



< 



1 



In 



exp (-Ar„(0)) Tr P) e(d9) + 



p,i 



n(l — p/n)' 



+ T ln 
A 



(5.5) 



+ r n 



p,i 
X 



I w, 
< R (p, £, A) + - In 

A 8 



n(l — p/rif 



W P J A £ 
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so we obtain 

(5.6) R^<R(p,£,X) + jhx-^ + ^ 



1, , Kkl-Kl) 



+ 5(C)1(cim)- 



(1 — p/n) 2 

First let us study the estimator 9. For any A > 0, let us choose = 1 when (p,£) minimizes 
R(p,£,\) and otherwise. Remembering that (ft, £, A) = argmini?(p, £, A), we obtain: 

A(^-iO i 



(5.7) 



R(B) < inf fl(p,i,A) + . 

P,M n(l - p/n) 2 \ 



-lne + 5 (C)l (C iM)- 



Now, we are going to upper bound the term R(p,£,X). From inequality (15. 4p we derive that 



5- 







exp(-Ar n (0))7r Pi £(rf0) 



inf 



r n (6) p(d3) + -/C (p, tt p/ ) 



< 



inf 



< i 




peM 




u ■ 

Je> P ,t 















inf 

peM\{e p ,i) [Je Ptt 



R(6) P {d0) + -K, (p,ir p>e ) \ + 



r n (6) p(d9) + -/C (p, tt p/ ) } + g{C) 1 



(CIM) 



XH +T ln 7^T + ^) 1 (ciM) 



n(l-p/n) 2 A £w P: i 
A k ri 



p/n) 2 A ewpj 



So we obtain: 



A 



(5.8) R (p, £, A) < -- In / exp --R(0) ir p ,t{dB) + 



X^, 



A(fc2+K2) 1 



n(l — p/n) 2 A 



+ T ln 7^ + ^) 1 (ciM ) - 



Now, let us remark that, as soon as A > 2e, we have that 



In 7T- 



< In - 



as we work under Assumption (13. ip and it easily follows that 



In 7r, 



pJ 









exp (-H_ 


< -ln7T p ^ 


exp ( - H. 



ln7Tn 



e Xp (--(i?-i?(0^)) 



+ ^7T0b(C)]l(CIM) 
+ ^(M + ^7T0b(C)]l(CIM) 

< dp,* In - + -zR(d p ,i) + ir7r \g(C)] 1 (C im)- 



We plug this result into the inequality (15. 8p to obtain: 
(5.9) 



R{p,£,X) < R(e Pti ) +j (d p/ ln 2 - + ln-p- ) + 



A 



EWpiJ n(l-p/n)< 



(CIM)- 
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Now we can conclude to the result of Theorem 13,21 We work under (WDP) so that (g(C) + 
7To[<7(C)])1(cim) = an d that R = R. It remains to collect the informations of the inequalities 
(E2D and (EJ3. We obtain: 



(5.10) R0)< mf x {R( P ,e,X)} + ^ 



\{kl-Kl) 1 



— Ins 



(1-p/n) \ 

\(kl + Kl) 



< inf {r (9 p>i ) + 1 (d p/ In 2 £ + In -^U + 



| \{kl-Kl) l lng 
n(l — p/n) 2 \ 



So it remains to get rid of the two last terms. First of all, let us control (1/A) ln(l/e). Remember 
that A is a the minimizer of 



R(p,l,\) = ~ln [ exp(-Ar„(0))Ar 0i (0) + ih— + ^ 



2 



u p,i nil-?:' 2 



1 \K 2 
G(A) + ^ln|g|+ - n ^ 2 



n 



1 - ^ 



where G is a decreasing function as 

G'(A) = ^ In / exp (-Ar n (0)) dvr - i / r n (0)dvr |{-Ar n }(0) - In — 

and we can check that each of these three term is negative. So this means that A > A where A is 
the minimizer of 

1, . A^ 2 
A 



— Ill | | i ; — >• 



n l-# 



it appears that A is known in explicit form and so we obtain 

1.11,1 K n \ 1K n \ 

— In — < — in — = -7 r ^ = ^ = in — < — — In — . 

X e X e h_|j v /^-^j e v 7 ^ e 

So, Inequality 15.101 becomes 

+ Kl) 



(5.11) «W < Jjf {« + i In' * + ta-M.) + ^ 



|2 



■ p/n 

A(fc 2 - if 2 ) 2K 1 
+ n(l-p/n) 2+ ^ ln ? 



Let us now consider two cases: A;„ < K n and fc„ > i^ n - If k n < K n , Inequality 15.111 becomes 

,5,2, nt)<M { RC e r , t ) + \(^^^) + ^±^} 
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and let us replace the infimum with respect to A by the specific value 

1 — p/n 



X*(p,£) 



that well balances the second and the third term of the sum in J5J2). Now if n 2 > X*(p,£) > 4e 
then we can find a A' £ Q such that A' > 2e and A' < \*(p, i) < 2A'. This holds if, for example, 
p < n/2, nln 2 n > (8eK n ) 2 and d p ^ < nK n . This leads to the following inequality 



(5.13) 



k n /K n + 2K n d p/ 



R{9)< inf lR(§ e )+-nL. 

- ' f <nK n I v p ' tJ 1 - p/n 



[g] 

ew p e 



— — In (dp tn) H 

n (1 - p/n)y/d P! enhi(d p> en) 



2K n . 1 
■In -. 

n e 



Now, let us consider the case where k n > K n (this is the difficult case). We have 

Kkl-Kp _ kl-Kl\{kl + Kl) 
n{\—p/n) 2 k 2 + K 2 n(l — p/n) 2 



k 2 + -^n 



■In 



exp (-Ar„(0)) ^7^(0) + T In 



+ 



A(fe2+K2) 



k 2 - K 2 



by definition of (p, £, A) and so, using Inequality 15.101 we obtain 



(5 . 14 ) «WS|1 + ^|)^ 



The same particular value of A leads to 



V i + T ( d p/ m2 2 + m 



+ 



X(k 2 + if, 



n(l — p/n) 2 

2K n 1 
H — In - . 

n e 



k 2 , -\- K 2 k 2 , K„ 



k 2 n /K n + 2K n d p 



K n ln 



+ 



p/n 

m 



— ln(d Pi in) 



(1 - p/n) d p ^n \xi{d p ^n) 



2K r , 1 
H — In - . 

n £ 



If we combine 15.131 and 15.151 on both cases, and if we remark that \Q\ < log 2 (n 2 ) < 31n(n), we 
obtain 



R{6) < IV 



2 Av 



k 2 + K 2 J d Pi i<nK. 



inf I R (6 p/ ) + K n 



2 + (k n /K n ) 2 dpj 
1 — p/n V n 



hx(d p tn) 
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4 In 



+ 



3 In n 

ew p j 



that ends the proof. 



5.3. Proof of Theorem 13. 3L Here we deal with both cases (WDP) and (CIM) at the same 
time. We also use the results given in the Proof of the Theorem 13.21 More precisely, it as 
been shown that for any (p,£) such that d p< £ < nK n the following inequality can not hold with 
probability larger than e: 



R{9) and R{9) > 1 V 



R(6 p>e )+K„ 



2 + (k n /K n y d„ t 



1 — p/n 



n 



ln(dp^n) 



4 In 



3 In n 



+ 



y/ d Pi in\a.{d p ^n) 



+ {g{C) +7r b(C)])l (CIM ) 



and so for 



R{6) and R{9) > 1 V 



2kl 



R [0 P ,i) + K n 



2 + (k n /K n ) 2 dp, 



1 — p/n 



n 



In (dp^n) 



4 In 



31nra 



+ (g(C) + 27rob(C)])l (C iM) = a n (p,£) + b n (p,£) log -. 



y/d Pi £n\n(dpjn) 

Let us now deal with 9 (the proof for 9 is similar). For any (p,£) with d Pt £ < nK, 



7T 



g 21> n (p,£) > £ ~2 



this leads to 



and so 





R(0)-a n (p,£)~ 


j-OO 


H(§)-a„(p,<) 






e 26 n ( P ,£) 


= / vr 


g 26 n (p,£) > i 


dt < 






JO 







1 A -g ) <// 2 



7T0 



R(9)-a n (p,£) <2b n (p,£)ln2. 



Replacing a n (p,£) and b n (p,£) by their definitions we obtain 



7T0 



R(0) 



< IV 



2Al- 



R {6 p j) + K n 



2 + {k n /K n y d p 



1 — p/n 



n 



]n(dpjri) 



4 In 



+ 



12 Inn 



yjd v ,in In(dp^n) 



+ 3vr [c/(C)]l(ciM) 



Under (WDP) we then get the desired result. Under (CIM), we use the result of Lemma 
to choose C in order to well balance k n (C) given in Lemma I5TT1 and g(C). We fix it equal to 

Inn « f 1 + 2 E"=i kfo< fc < P + £°° =fc a,-(F)}) 

C* = and c* = — i , 1 t—t ±A 

2c* 2(1 - a(F)) 
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Remark that this choice is independent of p, i. This ends the proof for 9 and the same results 
hold also for 6. 

5.4. Proofs of Lemmas 15.11 15.31 15.41 15.51 and of Proposition 13.11 

Proof of Lemma \5.1[ The proof of this Lemma is based on the application of a useful inequality 
from Rio [251 on X. Let us first recall this result: 



Theorem 5.6. Let Y = (Yt)t& be a stationary time series bounded by C distributed as ttq on 
Let h be a 1-Lipschitz function of X n — ► It, i.e. such that: 



(5.16) V(xi,yi, ...,x n ,y n ) £ X 2n h(x x , ...,x n ) - h(yx, y n )\ < ^ \\xi - y { 

i=l 

Then for every t £ 1R we have: 

ft 2 \2 N 
vr [exp(t(7r [/i(Xi, X n )] - h(X u ...,X n )))] < exp -n(C + 2, 0^(1) 



Proof of Theorem 15.61 We achieve this version of Theorem 1 of [25] remarking that we can rewrite 
the inequality (3) in [25j as, for any 1-Lipschitz function g: 

T(g) = \\E(g(X e+u ■■■ , X n )\F t ) - E(g(X e+1 , • • • , *„))||oo < 0oo,n-*(l)- 

It leads to the result of Lemma I5T61 when bounding Y^r=i(C + ^oo,r(l)) 2 with n(C + #oo,n(l)) 2 as 

SUp 1<r<n #oo,r (1) < ^oo,n(l). □ 

We now apply the result of Theorem 15.61 on Y = X to obtain the result of Lemma 15.11 Let us 
fix A > 0, (p, £) e M, 9 e @ P! e and t = (1 + L)A/ [n-p (6)} and the function h defined by: 

1 - 

h(xi, ...,x n ) = - - \\xi - f0(x i - 1 ,...,x i _ p ^)\\ . 

i=p(0)+l 

We easily check that h satisfies condition 15.161 in order to apply Rio's inequality. Note that: 
h(xi, ...,x n ) - h(yi, ...,y n )\ 



1 " 

< 1 L ^ \\ Xi ~ fe( x i-u ^i-p(6»)) || - \\yi - feiVi-i, -,yi- P (9)) 

i=p(0)+l 
1 ™ 

< 1 L ^2 IN ~ Vi - fe(xi-i,-, %i- P (e)) + fo{yi-i,-,yi- P (o))\\ 

i=p(0)+l 

1 " 1 " 

< 1 L Y II^-^II + tttl ^ ll^^-i'-'^-pw) _ /e(?/i-i>->yi-p(fl))| 

i=p(0)+l i=p{9)+l 

Y n L n P ^ 



1 + L 

i=p(0)+l i=p(0)+l i=l 

1 n L n 

- TTI S N-wll + i+xEll^-wl 

i=p(0)+l i=l 
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< ^2 \\ x i ~ Vi 



i=l 



The direct application of Theorem 15.61 ends the proof under (WDP). Under (CIM) k n is com- 
puted in view of the estimate of #oo,n(l) obtained in Lemma l2?2l □ 

Proof of Lemma [57^ Integrate the inequality in Lemma l5~T1 with respect ir P! £ on Q p< £ (then p{6) = 
p) for any (p, () G M in order to obtain: 

— / X 2 k 2 

7r^[vr [exp(A(i? - f„))]] < exp ( " 

Pubini's Theorem implies that 



7TQ 



exp ( A ( R — r 



2;„2 



< 1. 



n(l — p/n) 2 t 

Applying Lemma IST21 for 7r = tt Pi i and /i = X(R — r n ) — X 2 k 2 /(n(l — p/n) 2 ) on .M+^p^) leads to 
the inequality: 



7TQ 



/ _ \2/j2 

exp sup {\p[R-r n ]-IC(p,Tr Pte )} - — f 



< 1. 



This ends the proof. 

Proof of Lemma \5.4\ First, let us choose A G A. Let h^g denotes, for any (p, i) G M: 



□ 



h 



sup {Xp p ^[R 



X 2 k 2 

fc(p P ,e,ir Pi e)} T 7- v , . 

nil — n p^ z 



P P ^M\{Q P:t ) - ^ - ' n{l-p/nf 

From Lemma I5T31 applied on the different A4+(O p ^) we have, for any (p, £) G M: 



7T0 



< 1. 



^ exp (hp, 

Now we apply Inequality (ED) in Lemma[52]for tt = J2(p,i)eM w p,e$( P ,e) and h = J2(j>,e)eM hp,e^& P ,, 



and we obtain 



7To exp 



sup 



w 'p,t h vt- Y w 'p,t ln ( w 'p,e/ w pJ 



< 1 



and, by Jensen's inequality, and replacing h\ by its definition, 



(5.17) vr 



sup , \ Y w 'pJ sup exp ( Xp P< 1 



X(R-r n )-ln 



In 



dpp,i 
dir Pj e 

Wp,l 



n(l-p/n) 2 w' pt J 
By Jensen again, we obtain a bound for the first term in the sum bounded in Lemma 15.41 



< 1. 
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7T0 



sup 

(p,t)£M w ' P ,e =1 



w' v( sup p p/ 



exp ( X(R — r n ) — In 



dp P ,, 



dir 



2 1.2 



n(l — p/n)'' 



+ ln^ 



w 



< 1. 



Finally, we sum this inequality over all A £ Q to bound the first expectation. 

The second expectation is bounded by choosing specific weights w' pi in the supremum in in- 
equality (|5.17|) such that w' e = 1 for (p, £) = argmaxM{/i p ,4 : 



7TQ 



sup 

( P , i) e m 

P P ,e G M+(0 P ,«) 



2;„2 



exp \ppi[R - r n ] - JC(p Pi e, tt p ^) 



X k, 



n(l — p/n) 2 



+ In w Pji 



< 1. 



□ 



Again a summation over all A G ^ leads to the result. This ends the proof. 

Proof of Lemma \5.5\ . From the proof of the Lemma UTTl we already know that \f n (6) — r n (8)\ < 
(1 + L) Yli=i — Xi\\. This bound holds uniformly on 6. Now we are reduced to estimate 
7To[||X — X \\]. For this, we use the assumption (|2.5p and the stationarity of X and X. More 
precisely: 

vroOlXo -Xo||] < urtUa - foil] +Y, a i( F )*o[\\X- j -X^\\] 

< u/x[||£ ||l| N|>c ] +a(F) 7 r [\\X_ j -X_ j \\]. 
The result follows from the estimate m[||Co||1|[^ ||>c] < /i[exp(c||£o||)]Cexp(— cC) for any c > 0. □ 

Now give the proof of the useful Proposition 13.11 
Proof of Proposition \3.1[ Let us introduce a parameter ( > then we have 

- 1 hnv p>e [exp (-7 (R - R(e p>e )))] - C = -~ hiTT^ [exp (-7 (R - R(0 p/ ) - ())] 



< ~hnr Pt t [R(0)-R(6 p>1 



<C) 



Then we directly derive from the definition of d Pt £ that 

infc>o{C7 - lu Ttpjg {R(6) - R(9 p/ ) < ()} 



d p / < sup ■ 

7>e 



So 



C7 — dim In 



< dim A -yC(c Pt £ 



>p,i 



In 7 



+ dim In ( — Jb£L y 
\ dim 



Cp,£ 



Cp,i — \\Vp,t\\ J 

Now if dim < ^jC{c p ^ — ||6> p ,£||) then we get the estimate dim(l + ln(Cc p ^7/dim))/ln7 which 
decreases with 7. We then get the desired bound when the supremum is established for 7 = 
e V dim I (C(c p / — \\0 p /\\)). If dim > ^C(c p ^ — \\0 Pt n\\) then we get the estimate {^C(c P: £ — \\Q Pl e\\) + 
dim\n(c p ^/ (c P: i — \\6 P) i\\))) / \n.^ which increases with 7. Then we have to consider the case 7 as 
large as possible, that is when dim = jC(c Pt £ — ||6> Pi £||) but then we are going back to an already 
treated case. □ 
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5.5. Proofs of results stated in Subsection 12. 4L We first prove the existence of a solution of 
the chains with infinite memory (|2.4p . Then we prove the #oo-weak dependence properties of this 
solution when innovations are bounded and of the bounded (^-mixing processes. 

Proof of Proposition [Ql Let us fix some r > 1. Then £t[||£olN < 00 from the Jensen's inequality 
as £0 admits finite moments of exponential orders. Now we want to apply Theorem 3.1 of [14] to 
F that satisfies 

||i ? (0^ )||r||r<n||eo||r<00, 

denoting || • || r for the L r -norm (/^[- r ]) 1//r . We also have for any x = (xj)j^ and x' = (x'j)j^ the 
following relation 

00 

\\F(x;£o) - F(x';Zo)\\ r < J2 a j( F )\\ x 3 ~ x 'jl 

3=1 

Using assumption (|2.6|) we obtain the existence of a unique causal stationary solution to equation 
(12. 4|) such that 7r[||Xo|| r ] < 00. Finally this result holds for all 1 < r < 00 and we have proved 
the proposition. □ 

In the sequel, we prove results of Lemmas l2.2l and l2.31 These two estimates of the ^-coefficients 
are obtained via a common classical technique that we present shortly below, see [11] for more 
details. The so-called coupling techniques consist in constructing a version (X*) t ^z distributed 
as (Xtjtez and such that (X^) t>0 is independent of ©o = <?{Xt,t < 0). If this process (X^) t>0 is 
well defined, then it gives sharp estimates of the quantity 0oo,n(l) as we have the following version 
of the Kantorovitch-Rubinstein duality, see [H] for more details: 

Lemma 5.7. For any version (X^)t^z we have 

n 

(5-18) 0oo, n (l) < Y, M\ X i " **ll/®o)||oo- 

i=l 

For the sake of completeness, we recall the proof of this lemma. 

Proof of Lemma 15,71 To compute a bound on the coefficients 6oc(&,Z) for this solution we first 
need to introduce coupling arguments coming from Dedecker et al. [12] associated with the 
coefficients defined as 



sup 

/6A1 



E(/(Z)|S)-E(/(Z)) 



First note that this coefficient is in fact the same than Ooq. But we prefer in this section this 
formulation as it lets appear the supremum on the class of Lipschitz function, the Weisserstien 
metrics. If the space is enough rich, we have, as in [12], the Kantorovitch-Rubinstein equation 

(5.19) 7-00(6, Z)= inf ||E(||Z-Z*||/6)||oo 

where V is the set of the random variables Z* distributed as Z but independent of S. 

To bound ^^(1) for all n we have to consider a version of the whole process (Xt)tez denoted 
{X*) t <=-z such that (X^) t> o is independent of So = o-(X t ,t < 0). As we equip X n with the norm 
|| \X\ , . . . , x n 

)ll = Sr=i I Ml we immediately get the inequality 



Toc,n(l) < MM*!, ■ ■■,*»)- ( X l- ■ ■ ,^n)lll©0)||oc < ^ WHW^i ~ 
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□ 

We conclude the proofs of Lemmas 12.21 and f2.3l by choosing carefully the version in order to get 
efficient bounds on ||E(||JQ — X*|||A^o)||oo for all i. As there exists plenty of different coupling 
schemes in the literature, we now have to chose one that gives (X^) te z that give efficient bounds 
for each Lemmas. In the Lemma l2~2l we use the forward coupling, see [23] for more details on this 
techniques. The maximal coupling of [15j is used for the proof of Lemma [ 



Proof of Lemma Wlk We apply theorem 3.1 of [14] checking the relations (here F(0; 0) is fixed to 
for convenience) 

11^(0; fo)||r||oo < n||e ||oo < oo, 

oo 

HFOr^-iV^lloo < Y, a ^ F )W x 5- x % 

3=1 

where || • ||oo denotes the L°°( / u)-norm. As (|2.6p holds, we can conclude of the existence of a unique 
causal stationary solution to (|2.4p such that H-XqHoo < oo. Moreover, it follows easily from the 
construction that [|Xo[|oo < ^||£o||oo/(l — a)- 

Now let us use the coupling Lemma IBTTl on XI that we construct as follows. Let (£*)tez be a 
stationary sequence distributed as (£t)tez, independent of (£t)t<o and such that £t = Q for t > 0. 
Let (X*)t£z be the solution of the equation 

X? =F(Xl l ,Xt 2 ,. ..;£), a.e.. 

Let p / be an integer and (X^)t e z be the solution, bounded by ii||£o||oo/(l — a), of the equation 

(5.20) X^=F^(X^ 1 ,...,X^ p ,^ t ), 

with FW(xi,...,x p ;£) = F(x u . . . ,x p ,0, . . . ;£) for all (x 1 ,...,x p ) £ X v . Let (X t (p) *) t be the 
solution of Equation (15.201) with the innovation (£*)tez. This coupling scheme is the forward 

coupling one for the (X t for all p. We have 

||E(||x r -x;i||^ )||oo 

< ||E(||x r - xW\\\Mo)\\oc + - xW*\\\Mo)\\oc + P(||x; - Xj?>*\\\M )\\oo. 

Remark that 

m\\X^-X^*\\\Mo)\\oo < [TO^H*r^ 

< ^^(^HEdlX^. -X^*\\\Mo)\\oo. 

3=1 

Denote u t = sup 3 > t ||E(||a| p) - Aj p) *|||A^ )l|oo for all t G Z that is bounded with 2u||f ||oo/(l-a) 



by construction. Moreover for all n > 0, u n < a(F)u n - p and then u n < a(F)t n / p ] +1 uo. Using that 
< 2 [| -Xq P ^ ] [ oo < 2u[|£o||oo/(l — a) and that [n/p] + 1 > n/p we obtain 

E(||XW - AtH||M )||oo < 2^^a(F)^. 
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For the other terms we use the Lipschitz condition on F 

\\E(\\X r -XM\\/M )\\oc < and ||E(||x;-X^l/.Mo)||oo < ^^^(F). 

j= p j= p 

Finally, merging those two bounds we get 



||E(||X r -X r *|||Mo)||oo<2^Mf inf { 

1 — a o<p<'~ 



l{F y/P + Y^a 3 {F) 



3=P 



We conclude by using (|5.18|) in Lemma EH □ 

Proof of Lemma \2.3l Here we will consider the maximal coupling scheme of [E]. There exists a 
version (X^)t^z such that 

\\P(X t ^ X* t for some t > r|6 )||oo = sup \F(A/B) - P(B)\ = cp(r). 

(A,B)e6 x5 r 

Now let us denote X the state space of {X t )t^i- As H-X^cx, < C we can always fix X such that 
\\x — y\\ < 2C\ x ^ y for every x,y in X. Thus we have: 

||E(||X-A7||/6o)||oc < ||IE(|^-jr*|/©o)||oc 

< 2C[|E(l^ #x? /6 ) 

< 2C\\F(X i ^X*/6 ) 

< 2Ctp(i). 

The last inequality follows from the rough bound 

p(x + x*/e ) < p(\Jx t jt x;/e ). 

t>i 

We conclude by using (|5.18|) in Lemma l5"77l □ 



loo 
loo 



5.6. Proofs of results in Section SI We proof the Corollaries 14.31 and 14.41 of Theorem 
applied in the context of Neural Networks and projection in the Fourier basis predictors. 

Proof of Proposition \4-3\ Firstly we check that all the predictors are L-Lipschitz functions of the 
observations. For any x,y £ W, as the function called clip is 1-Lipschitz, we have 

I 

\fe(x) - fe(y)\ < I ^ c k ((j)(a k • x + b k ) - 4>(a k • y + b k ))\ 

k=l 

e i i p 

< Dl}, \c k \\a k ■ (x - y)\ < D x |c fc |||a fc ||i||x - y||oo < £>i ]| |[ofe|[l [|oo ^2 Wk\ ' Xi ~~ Vi \' 

k=l k=l k=l i=l 

Then when 8 G £> Pi £ we are sure that L = D\(ji + (3£)~ 1 )(C P + 1/3) is a convenient Lipschitz 
constant. 

Secondly we use the approximation estimates given in [6]. More precisely, using Jensen to 
estimate Li-risk by L,2-risk, we know that 

. . . . . . -( ',. 

7T0 



med(X | . . . , X_ p ) - / 3 ...,X_ 



VI 
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Then using Theorem 13.21 there exists some constant C > such that for sufficiently large n and 
as soon as 

(1) 



(5-21) 
we have 



K > 



l + L 



n [R(§)] and 7T < inf <j R {d p/ ) + C I J ^ \n(d p/ n) + " 



p.e 



n 



y/d Pt en\n(d Pt en) 



Then let us remark that (|5.2ip is always satisfied for sufficiently large n as in fact L goes to oo 
with n through T£ and i. On the opposite, using the estimate of d P: i and the assumption on C p 
we know that for n large and for some constant C it holds d Pt g < Cpi\n{n). Combining with the 
approximation bound, it holds 



MR0)} and 7r Q [R0)} < inf { vr [\X - med(X \X^, . . .,X- P )\] + ^ + C\ —\n z/2 { 
- ■ y/i V n 



n) 



When fixing i = \J n/pln 3//2 (n) the result follows. □ 
Proof of Proposition \4-4\ Let us apply Theorem l3.3l and we obtain for some constant C the relation 



vr and tt o [R(0)] < inf I R (9 p/ ) + cJ ^ ln(d p/ n) ln( 

I V 71 



<inf^(<W) • r 



n 



ln(d P0) en) ln(ra) 



Now, we have 
R (Opo,t) = j n f [\x p +i - fe Po /X p , ] 



< 



7T0 



Ml 



X 



p+1 



+ inf E 



< 



Ml 



Po n 



j=i j=i 



8=1 



/'li 



7T0 



i=l 



Now, note that the hypothesis on the process implies that X\ has a density upper bounded by 
l/y/2-no 2 and so we obtain 

-. PO „ n 

R{e PQ ,e) <MM) + ^=infV / I>>i^) 
V27ro- z fee . =1 J 

1 Po / f r n 

- mn) + T^ps? y - 

i — 1 \ i — 1 
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PO sr^po 



< KM) + -== £ ^r" < KM) + ±% 



So now we have 

(5.22) no[R0)} and n Q [R0)] < KM) + mf { r s ^zi, 1 + C\j ^ ln(ri Po/ n) ln(n 

* [ v27rcH V n 

Now, we estimate c£ po ^ using Propostition 13.11 and we obtain 

d P} L=pt(l + hx[L\—\J~ 



We plug it into Equation (|5.22|) to obtain for some C > and sufficiently large n 

ir o [R0)\ and n o [R0)] < KM) + mf \ r s ^k^ + C l/— In (po^) ln(n) 1 . 

* [ V27r(T 2 V n J 

i 

In particular fixing £ proportional to n 2s + 1 leads to the result. □ 
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